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K-means cluster datasets with different amounts of each feature. After that, we utilized five 
K-nearest neighbor machine learning (ML) algorithms and evaluated their performance using 
Logistic regression seven performance evaluation matrices for both augmented and non- 


augmented datasets. The same procedure was performed on both datasets. 
Then, we use both datasets to test how well the classifier works. Logistic 
regression (LR) is the most accurate method for the top nine features in the 
augmented dataset (90.77%). 
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1. INTRODUCTION 

Bangladesh's economy is dependent on agricultural production and this fact is widely recognized. 
Agriculture is a major driving force behind this progression in a developing country like Bangladesh. 
Bangladesh: employment in agriculture, a report by the World Bank, says that more than 39% of all jobs in 
Bangladesh are in agriculture [1] and agriculture also makes a big contribution to the country's gross 
domestic product (GDP), which is 12.92% [2]. Bangladesh: agricultural GDP share. Diseases affecting plants 
are a significant source of economic losses in the agricultural sector. Therefore, it is critical to detect plant 
diseases by observing their outward manifestations early before the infection can spread to the healthy plant. 

Cauliflower, the scientific name brassica oleracea, has undergone many genetic changes and is now 
grown on every continent. China, India, the US, Spain, Mexico, and Bangladesh cultivate commercial 
cauliflower. Comparatively, Bangladesh produces 73,000 metric tons of cauliflower annually on 9,400 acres 
[3]. The aggregate nutrient density index (ANDI) score, which looks at how many vitamins, minerals, and 
phytonutrients are in a food, says that cauliflower is among the top 10 most nutrient-dense foods [4]. 
Cauliflower is a nutritional powerhouse with high levels of vitamins C and K [5] and can be eaten both 
cooked and raw in salads and relishes. Diseases can impede cauliflower's growth, lowering its quality and 
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yield. Traditional methods for diagnosing cauliflower infections are arduous, time-consuming, and costly 
making them unfeasible for large-scale farming operations. Farmers in less developed nations or rural 
Bangladesh may need to travel to meet with professionals. 

This study addresses the application of machine learning (ML) to recognize and predict cauliflower 
diseases such as black rot, downy mildew, bacterial spot, and fresh leaves. Our framework is a cloud-based, 
ML-powered platform that uses mobile images as input. K-means clustering is used to categorize diseased 
samples. Then, five classifiers were used to train and assess disease recognition. Seven measures were used 
to evaluate the algorithms. The main goals behind developing these models are: i) recognize cauliflower 
disease early on in an automated way; ii) gray-level co-occurrence matrix (GLCM) feature extraction was 
used to pull features from the collected images. Analysis of variance (ANOVA) feature selection was used to 
rank features based on mutual information scores; iii) cauliflower diseases require a systematic organization 
of the most reliable features for training and testing classifiers; and iv) our model accurately predicted 
cauliflower disease from image data. 


2. BACKGROUND STUDY 

Disease in plants is a significant problem in the agriculture sector. Numerous investigations are 
undertaken to detect diseases in apples, rice, and cauliflower. Even though much work has been done to 
determine what's wrong with cauliflower, more must be done to make it work better. Sasirekha and Suganthy [6] 
suggested a k-means clustering algorithm for carrot disease. This study employed k-means clustering to 
segment images GLCM features help find the effect region to determine the standard deviation, IDM, 
entropy, root mean square (RMS), smoothness, variance, contrast, skewness, kurtosis, and correlation. 
Support vector machine (SVM) was used to classify carrot diseases in this article. But they can't mention the 
categorization accuracy of their model. Sari et al. [7] proposed an agro-medical method to identify papaya 
disease. For training and validation, they used 50 papaya trees. They used flexible Naive Bayes classifier 
(FNBC) to determine how well this model was validated and compared it to the forward-changing technique. 
With FNBC, the validation accuracy was 88%, while it was 90% in the forward-changing phase. This study 
used specific data and forward change handles tiny amounts of data. So, its validity and accuracy may be 
questioned while managing massive amounts of information. Gu ef al. [8] used hyperspectral imaging and 
ML to determine early on if a disease affected tomatoes. ML algorithms (boosted regression tree (BRT), 
classification and regression tree (CART), random forest (RF), and SVM) are used to find and confirm 
diseases. Switched parasitic array (SPA) and genetic algorithm (GA) are used to design environments. In this 
study, a BRT worked 85.2% of the time and had an area under curve (AUC) of 0.932. 

Research by Chaudhary et al. [9] came up with the EnsPSO technique, which is a mix of voting, the 
particle swarm optimization (PSO) algorithm, the correlation-based feature selection (CFS) method, and 
random sampling. Its goal is to make it easier to find agricultural diseases. They used three datasets to train 
and test the proposed approach and the voting method. The EnsPSO method was more accurate than Vote 
(96%). Kianat et al. [10] developed a method for diagnosing cucumber leaf diseases that makes use of both a 
feature reduction method and a robust feature selection method. With the 900 samples from the six classes, 
quadratic SVM, cubic SVM, linear discriminant analysis (LDA), and linear SVM models were created. This 
method uses both feature reduction and robust feature selection methods. This method is PDbE-based. For 
this strategy, quadratic SVM was the best-fit model (93.5% accuracy). Islam et al. [11] constructed a nobel 
ML-based papaya disease detection system. They used 214 samples from an online dataset to build their 
model and used RF, k-means, SVM, and CNN. The CNN algorithm's accuracy was 98.04%. Habib et al. [12] 
suggested a machine vision-based papaya disease diagnosis. They used an existing dataset with five diseases 
but excluded the areas. Two feature selection strategies and three ML classifiers are employed to identify 
papaya. SVM classifier accuracy was 95.2%. Panigrahi et al. [13] performed poor ML work on maize disease 
identification. An online dataset utilized Naive Bayes (NB), decision tree (DT), k-nearest neighbor (KNN), 
SVM, and RF classifiers to classify the disease, but accuracy was poor. They got 79.23% accuracy, which 
was substantially lower than other work. Rajbongshi et al. [14] suggested a ML-based cauliflower disease 
detection method. To conduct this study, 766 disease images were used. K-means clustering segments of 
diseased areas. BayesNet, Kstar, RF, logistic model tree (LMT), back propagation neural network (BPN), and 
J48 classify diseases. The RF classifier scored 89% in this study. Methun et al. [15] presented a deep learning 
technique for carrot disease. This experimental CNN uses the VGG16, VGG19, MobileNet, and Inception v3 
models. Inception V3 had the most accuracy 97.4%. 

Based on the research of other authors, these studies were conducted to recognize cauliflower 
diseases and those authors' suggested methods were applied to the original data. It is remarkable that we 
applied our model to augmented and nonaugmented data. We also applied GLCM feature extraction method 
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to identify interested regions from image data. The accuracy of our model might be better if we apply 
different feature selection techniques to choose features. 


3. METHOD 

This section outlines the several methods utilized to implement ML-based cauliflower disease 
identification. Our approach comprises three main parts: the overall architecture, extracting features from the 
collected images, and using an ML-based strategy to find diseases in cauliflower. This method's justification 
and implementation are elaborated upon below. 

In our proposed framework, images depicting cauliflower diseases are captured by smartphones and 
used in an online ML-based approach. Figure 1 illustrates the overall architectural design of our proposed 
machine vision-based expert system. Initially, consumers install our envisaged expert system app and capture 
images using their devices, which are then transferred as input through the application. The results are sent to 
the user through SMS when the analysis is finished using the proposed architecture. Finally, the user may see 
the outcome. 


| Receiving Dedsion 


il 


Figure 1. The system architecture for the recognition of cauliflower diseases using machine vision 


3.1. Image collection 

This dataset was collected by authors and it’s already available in data in brief [16]. This dataset 
included a total of 1,920 pictures, divided into four categories: bacterial spots, downy mildew, black rot, and 
disease-free. To train the model using an extensive dataset. Additionally, the model's performance is 
compared to the model trained without the data augmentation technique. After adding new information, the 
total amount of data is presented in Table 1. 


Table 1. Dataset overview 


Class Original data Augmented data 
Bacterial spot 173 300 
Downy mildew 170 312 
Black rot 160 280 
Disease-free 205 320 
Total 708 1,212 


3.2. Preprocessing 

In order to use the collected images effectively, it is essential to resize them to the correct 
dimensions, as they are all different. To begin with, the images were converted to a fixed length of 224x224 
using bicubic interpolation. Assume i and f gradually, and the derivatives are fx, fy, and fxy, which represent 
the four corners of a unit square (1, 1), (1, 0), (0, 1), and (0, 1), respectively, (0, 0), where m;; denotes the 
coefficients. The stability of the interpolation surface [11] is defined using as (1): 


fy) = Eizo Lie0 Mixy? (1) 


3.2.1. Histogram equalization tends to increase contrast 
Histogram equalization is used to intensify image contrast. Let's pretend that X and Y represent the 
number of rows (height) and columns (width) in pixels, that Ck is the color intensity of Pk pixels and that I is 
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the image's intensity level. The processed images [11] are explicitly defined as (2), where each pixel with Cx 
in is mapped to a pixel with color intensity Sx. 


1-1 
Sk =T(Ck) = TF 1s nj (2) 


3.2.2. Convert the colour RGB to Lxaxb 

The k-means clustering algorithm is a form of unsupervised ML. An RGB color space image is 
converted into Lxaxb color space for better segmentation. This conversion is only used for Lxaxb color space. 
After the contrast is increased, the effort required to convert to RGB is calculated. Since the result of the Lxaxb 
color space conversion is identical to the original, there's a compelling reason to utilize it. Convert to CIA 
before transitioning to the Lxaxb color space in the RGB color space. In (3) used depending on [11]: 


X 3.2405 —1.5372 —0.4985][R 
Y |= |-0.9692 1.8759 0.0416 ||G (3) 
Z 0.0556 —0.2040 1.0573 JLB 


The Xn, Yn, and Zn values of the reference white can be used to calculate the color space Lxaxb. 
More information [11] is available in (4): 


1 
t3 ift > 0.00885 
f(t) = 16 i (4) 
7.787 + ae ift < 0.008856 
For Lxaxb can be calculated by using (5)-(7): 
1 
116(Ž} -16 if +> 0.008856 
L= Yn Yn (5) 


903.3 (=) if — < 0.008856 
a* = 500 (r (=) -F ©) (6) 
b* = 200 (r (=) -f 6) (7) 


Afterward, the k-means clustering method is applied to segment the images, which essentially chops 
out the diseased regions of the leaf while leaving the healthy ones to remain. During both the training and 
testing phases, the features were found using the above method were used to train and test a classifier. Then, 
five state-of-the-art, top-performing classifiers are chosen from among a broad range of classifiers. This 
group includes the methods of KNN, Adaboost, logistic regression (LR), and DT. Each candidate classifier's 
performance is compared using various evaluation matrices to narrow the field and concentrate on the 
optimal solution. During the performance analysis phase, accuracy is never a good way to determine how 
well the classifier performs. Because it may not be suited for examining categorization patterns on data sets 
that are otherwise imbalanced. A few other performance evaluation matrices for classifier performance 
analysis [17], [18]. Results from a two-class classification method can be described as true positives (TP), 
true negatives (TN), false positives (FP), or false negatives (FN). However, the matrix R can be expressed as 
(8) for multiclass classification: 


R = [ey] nen (8) 
The fact that R is a square matrix is immediately apparent in (8). Which included N (rows) times N 
(columns), where N was more than 2 and N2 is the total amount considered. If we're talking about class i, the 


matrices can be computed as (9)-(12): 


TP; = eij (9) 
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FP; = Da ji (10) 
j+i 

FN; = Èj=1, ey (11) 
j+i 

TN; = Xj=1, bk=1,6jk (12) 
j+i kti 


After this procedure, the matrix R arrives at 2x2 dimensions. Consider this as a final result into 
action accuracy, sensitivity, error rate, specificity, precision, false positive rate (FPR), and false negative rate 
(FNR) are calculated as (13)-(19): 


Accuracy = n x 100) % (13) 

TPR = (2 x 100) % (14) 
TN 

TNR = en x 100) % (15) 
FP 

FPR = (2 x 100) % (16) 
FN 

FNR = — x 100) % (17) 

Precision = [—— x 100) % (18) 

Error Rate = (——— x 100) % (19) 


After using the cross-validation method, we used the receiver operating characteristic (ROC) to 
identify in (13)-(19). Finally, we select the most suitable classifier. 


3.2.3. Feature extraction with gray-level co-occurrence matrix 

We used image processing to pull out several statistical and GLCM features that help us spot 
diseases in cauliflower. We have selected the standard deviation (o), mean (u), variance (a), the skewness 
(y), and the kurtosis (k) [19]. If there are n pixels in the faulty region(s), where I is the gray-scale intensity of 
a pixel and I, Im, and Ir are the mean, mode, and standard deviation of grey-scale intensity of all pixels 
correspondingly, then the related equations of these features are as (20)-(24). We used image processing to 
pull out several statistical and GLCM features that help us spot diseases in cauliflower. 


aT 


g= fee (20) 
1 n 

w= 2D (21) 

o? = Fia (h-I) (22) 

y= = (23) 


ty" _(1,-1)4 

pa 8 (24) 
4 
(=x? a-D?) 


You can also think of GLCM as a gray-level spatial dependence matrix. Each pair in (i, j) indicates 
how often the pixel was used. At the same time, i co-occurred horizontally with j's pixel. 


Contrast: Yili — jl pCi j) (25) 


GE) UB) PGI) 


Correlation: iij ae (26) 
iaj 
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Energy: dij p(i, j)? (27) 


PGS) 


Homogeneity: jj lic 


(28) 


Entropy: Dizo (p(x; )log2p(x:)) (29) 


This subsection explains how to apply the extracted feature for disease identification in cauliflower. 
First, the retrieved features were used as input, then a training set and a test set were made from them. Next, a 
ranking of features was carried out, utilizing a total of ten different features. The training set was then 
balanced using synthetic minority oversampling technique (SMOTE) and a ML model was applied to both 
the training and testing data. In the end, performance evaluation matrices are used to assess the efficacy of 
every Classifier. In Figure 2, we depict all the procedures that are followed at this stage. The following is a 
comprehensive explanation of the technique that was discussed earlier. 


Figure 2. Working procedure of our proposed work 


3.3. An evaluation of features using collective understanding 

Different approaches are used for feature selection, such as the ANOVA correlation coefficient and 
the mutual information-based method [20]. This research was likewise conducted using a numerical input 
and a categorized output methodology. So, to rank the features used to diagnose cauliflower disease, we used 
mutual information and a target variable. Graphical representation of [21] mutual information between the 
two variables P1 and P2 (22). r (lı, l2) is the joint probability distribution function. 


(la) 
(Py P) = Erer Ltyer (ly b)log (Z2) (30) 


Mutual information details of the target variable represent in Table 2. 


Table 2. The mutual information value of features 


Rank Name of features Score of mutual information Rank Name of features Score of mutual information 
1 Entropy 0.13769223 6 Homogeneity 0.08052413 
2 Mean 0.109962 7 Kurtosis 0.07979244 
3 Standard deviation 0.10836427 8 Skewness 0.07792927 
4 Contrast 0.1057053 9 Variance 0.07106182 
5 Correlation 0.09430378 10 Energy 0.06997779 


3.4. Choosing the most important N-features and carrying out synthetic minority oversampling technique 

We have chosen the best N characteristics (S<N<10) based on the ranking. We divide the dataset 
again into a training set and a test set using the extracted N features. The model is trained using 80% of the 
data and then tested using the remaining 20%. The training set is very unbalanced based on these slices. 
Unbalanced datasets are not suitable for use in ML models. Hence, we used a method called the SMOTE. 
Using this method, the problem of the difference between classes can be fixed by making samples of the 
minority group. 
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3.5. Splitting up datasets 

The extracted features of the dataset are divided into a train set comprising 80% of the data and a 
test set comprising 20% of the data. A significant proportion of the data is employed to train our model with 
this division. Afterward, the efficacy of the model is evaluated employing various classifiers on the test set. 


3.6. Selected classifier for cauliflower disease identification 

Five machine learning classifiers, namely DT, RF, LR, AdaBoost, and KNN were utilised to identify 
maladies in cauliflower. These classifiers were utilised on augmented and unaugmented datasets. Their 
primary purpose is to accurately identify various cauliflower diseases. 


4. RESULTS AND DISCUSSION 

Since the farmer's smartphone and other hand-held devices will be used to take the sample images, the 
data we get to put into our model will be different in terms of size, viewing angle, and asymmetry. We changed 
the size of the original user-submitted image to work better for people from a wide range of situations and 
backgrounds. After that, the image is scaled down to a standard size of 300x300 pixels. We considered the wide 
variety of mobile device forms before settling on this standard size. Contrast mapping is employed to enhance 
the clarity of images. Then the resized images are segmented into 3 clusters using the k-means cluster algorithm. 
Feature extraction is a vital step for image-based classification. The quality of the segmented images determines 
the quality of the featured images. Cauliflower images are segmented, then the ten features are derived. The 
entire process of feature extraction is depicted in Table 3. We measure the performance of different applied ML 
algorithms using several performance evaluation matrices: accuracy, true positive rate (TPR), true negative rate 
(TNR), FPR, FNR, error rate, and precision. The model's effectiveness in identifying cauliflower diseases using 
various feature sets has been evaluated and compared in this study. 


Table 3. Extraction procedure of captured cauliflower images 
Selected class Capture image Contrast enhancement Segmented image Feature extracted value 


0.132, 0.872, 0.801, 
0.977, 24.433, 1.165, 
4.423, 526.335, 18.743, 
3.951 


Bacterial spot 


Downy mildew 2.709, 0.674, 0.348, 
0.835, 70.138, 3.554, 
9.090, 4.213, 2.519, 


1.04 


Black rot 1.048, 0.847, 0.273, 
0.893, 59.859, 70.778, 
4.450, 10.702, 1.990, 


0.663 


Disease free 2.659, 0.877, 0.353, 
0.913, 111.394, 4.284, 
10.460, 1.071, 1.199, 


0.373 


In order to analyze the effects of feature selection and augmentation strategy, we took the top ten 
features from the ranking and split them into four groups: 5, 7, 9, and 10. After that, we applied different ML 
classifiers on augmented and nonaugmented datasets. Tables 4 and 5 show how various ML models perform 
using a variety of feature sets with both selected datasets. 
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Table 4. Performance evaluation of nonaugmented data 


pie Model name woe TPR (%) TNR(%)  FPR(%) FNR (%) E a a = 
For top 5 features DT 75.71 53.85 88.64 11.36 46.15 73.68 24.29 
RF 81.16 59.09 91.49 8.51 40.91 76.47 18.84 
LR 81.72 89.66 78.13 21.88 10.34 65 18.28 
AdaBoost 72.31 53.33 88.57 11.43 46.67 80.00 27.69 
KNN 88.89 73.91 95.92 4.08 26.09 89.47 11.11 
For top 7 features DT 82.19 69.57 88 12 30.43 72.73 17.81 
RF 85.71 73.91 91.50 8.51 26.09 80.95 14.29 
LR 87.30 68.75 93.62 6.38 31.25 78.57 12.69 
AdaBoost 75.76 56.67 91.67 8.33 43.33 85.00 24.24 
KNN 90.67 80 96 4 20 90.91 9.33 
For top 9 features DT 85.00 94.44 79.69 20.31 5.56 72.34 15.00 
RF 85.92 75.00 91.49 8.51 25.00 81.81 14.08 
LR 81.25 90.63 76.56 23.44 9.38 65.91 18.75 
AdaBoost 70.53 70.21 70.83 29.17 29.79 70.21 29.47 
KNN 84.93 69.23 93.62 6.38 30.77 85.71 15.07 
For top all DT 80.60 59.09 91.11 8.89 40.91 76.47 19.40 
features RF 87.5 79.17 91.67 8.33 20.83 82.61 12.5 
LR 81.25 90.63 76.56 23.44 9.38 65.91 18.75 
AdaBoost 67.02 75.86 63.08 36.92 24.14 47.83 32.98 
KNN 85.14 69.23 93.75 6.25 30.76 85.71 14.86 
Table 5. Performance evaluation of augmented data 
Se Model name poi TPR(%) TNR(%) FPR(%) FNR (%) ee ae 
For top 5 features DT 85.29 85.71 84.91 15.09 14.29 84.00 14.71 
RF 87.88 97.30 82.26 17.74 2.70 76.59 12.12 
LR 80.85 93.55 74.60 25.40 6.45 64.44 19.15 
AdaBoost 75.53 75.00 75.93 24.07 25.00 69.77 24.47 
KNN 82.83 86.05 80.36 19.64 13.95 77.08 17.17 
For top 7 features DT 83.00 84.44 81.82 18.19 15.56 79.17 17.00 
RF 88.12 93.33 83.93 16.07 6.67 82.35 11.88 
LR 89.23 82.35 91.67 8.33 17.65 77.78 10.77 
AdaBoost 75.36 56.67 89.74 10.26 43.33 80.95 24.64 
KNN 87.13 95.00 81.97 18.03 5.00 77.55 12.87 
For top 9 features DT 88.57 80.95 91.84 8.16 19.05 80.95 11.43 
RF 87.13 86.54 87.75 12.24 13.46 88.24 12.87 
LR 90.77 78.95 95.65 4.35 21.05 88.24 9.23 
AdaBoost 76.47 61.76 91.18 8.82 38.24 87.50 23.53 
KNN 88.35 91.49 85.71 14.29 8.51 84.31 11.65 
For top all DT 85.00 90.24 81.36 18.64 9.75 77.08 15.00 
features RF 88.0 89.80 86.27 13.73 10.20 86.27 12.00 
LR 90.47 78.95 95.45 4.55 21.05 88.24 9.52 
AdaBoost 64.95 78.57 59.42 40.58 21.43 44.00 35.05 
KNN 87.50 91.30 84.48 15.52 8.70 82.35 12.50 


When considering nine features (entropy, mean, standard deviation, contrast, correlation, 
homogeneity, kurosis, skewness and variance) we noticed that the LR classifier resulted in the highest 
accuracy with augmented images, which is 90.77%, where 78.95%, 95.65%, 4.35%, 21.05%, 88.24%, and 
9.23% are the TPR, TNR, FPR, FNR, error rate, and precision, respectively. In conclusion, statics are 
preferable to GLCM. After that, we visualize the performance of the applied classifier using a bar diagram 
based on top-ranked features in Figure 3. The top 5 features of accuracy are shown in Figure 3(a) and the top 
7 features are presented in Figure 3(b). As the same as Figure 3(c), depict the top 9 and top 10 features of 
accuracy visualized in Figure 3(d). 

Finally, we applied the ROC curve to compare the output quality from the augmented dataset to the 
nonaugmented dataset and to determine which classifier generated the better output. ROC curve comparison 
between the two sets of data is shown in Figure 4. Better classifier performance is often associated with a 
larger ROC curve area [22], [23]. Using the enhanced data and the best nine features, the LR classifier 
achieves a maximum area under the ROC of 93.34%, as shown in Figure 4(a). Alternatively, utilizing the top 
five features of nonaugmented data, the DT classifier achieved the lowest ROC value is 74.89% is visualized 
in Figure 4(b). 

Comparative analysis is an essential part of the research. It helps a researcher find the research gap 
and make a new way to solve the problem efficiently [24]. A large number of research articles are available 
on agro-based systems. Table 6 represents the comparative analysis with other existing work. 
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Figure 3. Comparison analysis between five to all features for; (a) 5 features, (b) 7 features, (c) 9 features, 
and (d) all features accuracy 
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Figure 4. Comparison between ROC curve; (a) augmented and (b) nonaugmented 


Table 6. Comparative analysis with other existing work 


Related work Object Type Dataset size _ Segmented algorithm Applied classifier Accuracy (%) 
Sari et al. [7] Papaya Detection 50 N/A NB 88.00 
Panigrahi et al. [13] Maize Classification 3,823 N/A RF 79.23 

KNN 76.16 
SVM 77.56 
NB 77.46 
DT 74.35 
Behera et al. [25] Orange Classification N/A N/A SVM 90.00 
Jaisakthi et al. [26] Grape Identification 5,675 N/A SVM, AdaBoost, RF Average: 
93.03 
Pulido et al. [27] Weed Recognition 320 K-means clustering SVM 90 
Mia et al. [28] Mango Recognition 8 K-means clustering SVM 80 
Proposed work Cauliflower Recognition 1,920 K-means clustering DT 88.57 
RF 87.13 
LR 90.77 
AdaBoost 76.47 
KNN 88.35 
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5. CONCLUSION 

Extensive research has been conducted on an agro-medical expert system based on machine vision, 
with a focus on cauliflower. We extract 10 features from cauliflower images by using the k-means clustering 
method. The mutual information-based selection method is then applied to rank the features. After selecting 
the top N features and applying five ML classifiers to train and test the dataset, we have used SMOTE to 
maintain balance in the data. We are confident that our system operates very well across the area. Using the 
top nine features, our model achieved the highest accuracy of 90.77% with a LR classifier, which is to say 
that it is outstanding and has amazing potential. In the future, we will be doing a lot more work on 
recognizing cauliflower disease, and a big part will be leveraging big data to detect different kinds of 
cauliflower disorders. In addition to tomatoes, cucumbers, carrots, and cabbage. This technology has broad 
applicability in agriculture. 
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